Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.1 - Looking at Your Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Looking at Your Data

Spark is lazily evaluated. To look at your data you must perform a take operation to trigger your transformations to be evaluated. There are a couple of ways to perform a take operation that we'll go through here, and their performance.

For example, the toPandas() is a take operation which you've already seen in many places.

Option 1 - collect()

pets.collect()
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown'),
 Row(id=u'2', breed_id=u'3', nickname=u'Argus', birthday=u'2016-11-22 10:05:10', age=u'10', color=None),
 Row(id=u'3', breed_id=u'1', nickname=u'Chewie', birthday=u'2016-11-22 10:05:10', age=u'15', color=None)]

What Happened?

When you call collect on a dataframe, it will trigger a take operation, bring all the data to the driver node and then return all rows as a lists of Row objects.

Note

This should not be advised unless you have to look at all the rows of your dataset, you should usually sample a subset of the data. This call will execution all of the transformations that you have specified on all the data.

Option 2 - head()/take()/first()

pets.head(n=1)
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown')]

What Happened?

When you call head(n) on a dataframe, it will trigger a take operation and return the first n rows of the result dataset. The different operations will return different number of rows.

Note

  • If the data is unsorted, spark will perform the all the transformations on a selected amount of partitions until the number of rows are satified. This is much optimal based on how much and large your dataset is.
  • If the data is sorted, spark will perform the same as a collect and perform all of the transformations on all of the data.

By sorted we mean, if any sort of "sorting of the data" is done during the transformations, such as sort(), orderBy(), etc.

Option 3 - toPandas()

pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

What Happened?

When you call a toPandas() on a dataframe, it will trigger a take operation and return all of the rows.

This is as performant as the collect() function, but the most readible in my opinion.

Option 4 - show()

pets.show()
+---+--------+--------+-------------------+---+-----+
| id|breed_id|nickname|           birthday|age|color|
+---+--------+--------+-------------------+---+-----+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null|
+---+--------+--------+-------------------+---+-----+

What Happened?

When you call a show() on a dataframe, it will trigger a take operation return up to 20 rows.

This is as performant as the head() function and more readible. (I still perfer toPandas() 😀).

Summary

  • We learnt about various functions that allow you to look at your data.
  • Some functions are less performant than others based on if the resultant data is sorted or not.
  • Try to refrain from looking at all the data, unless you are required to.

results matching ""

    No results matching ""